Supplementary Material about distribution modelling of S. alveolata occurrence around Ireland

1 Summary - Distribution Modelling of Sabellaria alveolata reefs around Ireland

This document provides Supplementary Material to the paper Specific niche requirements of an intertidal ecosystem engineer promote multidecadal stability of its geographical range in the face of climate warming by Firth et al. It details the different steps of the distribution modelling, and includes complementary information to the main paper. The document is organised in sections as follows:

  1. Introduction (this section)

  2. Importing R libraries required for analyses

  3. Data Preparation and Variable Selection. This section includes information about environmental variables resolution and pre-treatment prior to statistical modelling.

  4. Preliminary exploring of results sensitivity to alternative modelling approaches. Comparison of alternative modelling approaches (namely GLM, GAM, RF, BRT, MARS, SVM applied with default settings using the SDM package), and Boosted Regression Tree tuned according to Elith et al. (2008) and applied using the ggBRT package. Note that this section justifies our choice to focus on results from Boosted Regression Trees (BRT) as it yielded better predictive capacities and consistent results relative to other approaches fitted with default settings using the ‘sdm’ package. [Elith, J., Leathwick, J.R. and Hastie, T. (2008), A working guide to boosted regression trees. Journal of Animal Ecology, 77: 802-813. https://doi.org/10.1111/j.1365-2656.2008.01390.x ]

  5. Presentation of main results from Boosted Regression Tree Presence/Absence Modelling. This section expands on the results presented in the main paper. Note that this section ends with a specific subsection dedicated to spatial autocorrelation in the initial dataset and in model residuals.

  6. This final section details the list of R packages and R session information used to produce this R markdown document.

2 Packages

2.1 Data handling

library(dplyr)
library(purrr)
library(magrittr)
require(readxl) # for the excel_sheets() function

2.2 Analysis

# Colinearity
library(usdm)

# SDM
library(sdm)

# Brt
library(ggBRT)
library(gbm)

# Random Forest
library(randomForest)

# AUC, Thresholding etc...
library(ROCR)
library(PresenceAbsence)

# CV and autocorrelation
library(blockCV)
library(automap)
library(ncf)
library(sp)
library(sf)

2.3 Graphics

library(ggplot2)
library(ggcorrplot)
library(ggridges)
library(gridExtra)
library(viridis)
library(scales)
library(tidyverse)
library(maps)
library(mapdata)
library(fields)
library(rgdal)
library(RColorBrewer)
library(ggpointdensity)

3 Import and Prepare data

Presence / Absence data correspond to all available field surveys post-2013 as presented in the paper. Environmental data were derived from available datasets as described below:

  • Substrate Substrate information; which was extracted from EMODNET seabed habitat, corresponds to the Folk sediment triangle and the hierarchy of Folk classification (EMODNET data). Factors are coded as follows [EMODNET levels 6-9 were removed]: 2 Sand 3 Coarse substrate 4 Mixed sediment 5 Rock & boulders 11 Mud 12 Sandy Mud 13 Muddy Sand

  • Fetch Wave fetch estimates, i.e. local coastline exposure to ocean waves, are after Burrows et al. (2008). Original data horizontal resolution is 100 m.
    Burrows et al. (2008). “Wave exposure indices from digital coastlines and the prediction of rocky shore community structure.” Marine Ecology Progress Series, 353, 1-12.

  • Tidal Amplitude was extracted from OTIS-OSU (https://www.tpxo.net/home). Tides were modelled based on Egbert and Erofeeva (2002) - “Efficient Inverse Modeling of Barotropic Ocean Tides.” Journal of Atmospheric and Oceanic Technology, 19(2), 183–204 - and Egbert et al. (2010) - “Assimilation of altimetry data for nonlinear shallow-water tides: Quarter-diurnal tides of the Northwest European Shelf.” Continental Shelf Research, 30(6), 668–679. Horizontal resolution is ~10 km.

  • Ocean front metrics Ocean front metrics correspond to these described in Miller et al. (2015). Climatological statistics were derived from daily estimates over the 2013-2018 period. Horizontal resolution of the original data is ~2.75km.
    Miller et al. (2015). “Basking sharks and oceanographic fronts: quantifying associations in the north-east Atlantic.” Functional Ecology, 29, 1099-1109.

  • Air temperature and Wind Speed Air temperature and Wind Speed are from the Météo France ARPEGE model as described in Déqué et al. (1994). Horizontal resolution of the original data is ~7.4km in longitude et ~11km in latitude. Climatological statistics were derived from hourly estimates over the 2012-2018 period. Déqué et al. (1994). “The ARPEGE/IFS atmosphere model: a contribution to the French community climate modelling.” Climate Dynamics, 10, 249-266.

  • Salinity and Sea Surface Temperature (SST) were from Amo et al. (2019). Horizontal resolution of the original data is ~3km. Climatological statistics were derived from monthly mean estimates over the 2013-2018 period. Amo et al. (2019). “Product user manual - Atlantic - Iberian Biscay Irish - Ocean Physics Analysis and Forecast Product: IBI_ANALYSIS_FORECAST_PHYS_005_001”. Available from Copernicus Marine Environment Monitoring Service.

  • Stratification Index was derived by combining EMODNET bathymetry with vertically averaged current velocity from Amo et al. (2019). Data resolution is as described above for salinity and SST.
    Amo et al. (2019). “Product user manual - Atlantic - Iberian Biscay Irish - Ocean Physics Analysis and Forecast Product: IBI_ANALYSIS_FORECAST_PHYS_005_001”. Available from Copernicus Marine Environment Monitoring Service.

  • Significant Wave Height was estimated as described in Bricheno and Wolf (2018). Data horizontal resolution is 0.083°. Climatological statistics were produced as in Briceno and Wolf (2018).
    Bricheno and Wolf (2018). “Future wave conditions of Europe, in response to high‐end climate change scenarios.” Journal of Geophysical Research: Oceans 123, 8762-8791.

# Define spatial resolution, input and output directories and file names
DataRes <- c('5KM', NA)[1]
input_dir <- 'InputData' 
output_dir <- paste0('./Outputs', DataRes)
sdm_occ_file <- "sdm_occ_n10_test30_thin_RandomCrossVal.Rdata"

# occurrence of Sabellaria & environmental data
sab_env <- read.csv(file.path(input_dir,paste0("sab_env+sub_",DataRes,"data_IE.csv"))) 
rm_row <- nrow(sab_env)
sab_env <- sab_env[!is.na(sab_env$Substrate),]
sab_env <- droplevels(sab_env) 
rm_row <- rm_row - nrow(sab_env)
xtabs(~sab_env$Substrate)
## sab_env$Substrate
##   2   3   4   5  11  12  13 
##  72  66   4  74   6  14 112
sab_env$Substrate <- as.factor(sab_env$Substrate)
# levels(sab_env$Substrate) <- c('Sand','Coarse substrate','Mixed sediment','Rock & boulders','Mud','Sandy Mud','Muddy sand')
names(sab_env['Substrate']) <- "Substrate" # "Substrate_type"
print(paste('Removed', rm_row, 'data points due to lack of substrate information around Connemara coastline'))
## [1] "Removed 5 data points due to lack of substrate information around Connemara coastline"

3.1 Remove colinear environmental variables

# Remove colinear variables
coord_bin <- sab_env %>%
dplyr::select(Longitude_5km, Latitude_5km ) 

# Initial correlation 
sab_env %>%
  dplyr::select(-Sab.ID., -Location.name, -Surveyor.code, - County.number, -County, -Administrative.region, -Latitude, -Longitude, -Co.ordinate.accuracy, -Day, -Month, -Year, - Intertidal.Subtidal, -starts_with("SACFOR"), -Substrate, -FrontGradDens_min, -FrontSide_min, -FrontSide_max, -FrontPersist_min) %>%
  cor(., method = "spearman") %>%
  ggcorrplot(., hc.order = TRUE, type = "upper", lab = FALSE, method = "circle", colors = c("tomato2", "white","springgreen3"))
\s

Figure 3.1:

sab_env <- sab_env %>%
dplyr::select(-FrontPersist_min, -FrontPersist_p10, -FrontSide_min, -FrontSide_p10, -FrontGradDens_min, -FrontGradDens_p10,-FrontPersist_p90, -FrontGradDens_p90, -FrontDist_p90, -FrontGradDens_p05, -FrontPersist_p05, -FrontSide_max, -FrontSide_p95, -FrontSide_p90, -FrontSide_p05,  -FrontSide_p50, -FrontSide_sd, -FrontDist_max, -FrontDist_p95, -FrontPersist_max, -FrontPersist_p95, -FrontGradDens_p95, -FrontPersist_p95, -FrontPersist_p50, -FrontGradDens_p50, -FrontGradDens_sd, -FrontGradDens_max, -FrontDist_min, -FrontDist_p10, -FrontDist_p50, -FrontGradDens_p90, -FrontPersist_sd, -contains("_p50"),-WaveHeight_p90,  -WaveHeight_p99, -WaveHeight_p10, -WaveHeight_p90, -WaveHeight_max, -WaveHeight_p95, -AirTemp_p90, -AirTemp_max, -AirTemp_p10, -SST_p90, -SST_p10, -SST_max, -FrontGradDens_mean, -FrontDist_sd, -MixedLayerDepth, -CloudCover, -AirTemp_mean,-AirTemp_min,-SurfCurrent, -Aspect,-SpringChlA,-Longitude_5km, -Latitude_5km ) 

# Check remaining correlation
sab_env %>%
  dplyr::select(-Sab.ID., -Location.name, -Surveyor.code, - County.number, -County, -Administrative.region, -Latitude, -Longitude, -Co.ordinate.accuracy, -Day, -Month, -Year, - Intertidal.Subtidal, -starts_with("SACFOR"), -Substrate) %>%
  cor(., method = "spearman") %>%
  ggcorrplot(., hc.order = TRUE, type = "upper", lab = FALSE, method = "circle", colors = c("tomato2", "white","springgreen3"))
\s

Figure 3.2:

# Check remaining collinearity 
sab_env %>%
  dplyr::select(-Sab.ID., -Location.name, -Surveyor.code, - County.number, -County, -Administrative.region, -Latitude, -Longitude, -Co.ordinate.accuracy, -Day, -Month, -Year, - Intertidal.Subtidal, -starts_with("SACFOR")) %>%
  vif(.) %>% 
  arrange(desc(VIF))
##            Variables       VIF
## 1              Fetch 18.866666
## 2  FrontPersist_mean  9.238758
## 3          WindSpeed  8.507063
## 4           Salinity  7.514083
## 5    WaveHeight_mean  7.254731
## 6     FrontDist_mean  6.725121
## 7            TideAmp  5.645464
## 8           SST_mean  4.024994
## 9      FrontDist_p05  3.771220
## 10           SST_min  2.973670
## 11    FrontSide_mean  2.955355
## 12      StratifIndex  2.692326
## 13         Substrate        NA

3.1 Note

Similar results are obtained if we take WaveHeight_p95 or WaveHeight_mean Similar results are also obtained if we take StratifIndex or SurfCurrent as StratifIndex was calculated according to Crips 1989 as \(SI=log_{10}(\frac{H}{u^3})\) where \(H\) = water depth and \(u\) = surface current velocity

  • Remove fetch
# Check remaining correlation
sab_env %>%
  dplyr::select(-Sab.ID., -Location.name, -Surveyor.code, - County.number, -County, -Administrative.region, -Latitude, -Longitude, -Co.ordinate.accuracy, -Day, -Month, -Year, - Intertidal.Subtidal, -starts_with("SACFOR"), -Fetch, -Substrate) %>%
  cor(., method = "spearman") %>%
  ggcorrplot(., hc.order = TRUE, type = "upper", lab = FALSE, method = "circle", colors = c("tomato2", "white","springgreen3"))
\s

Figure 3.3:

# Check remaining collinearity 
sab_env %>%
  dplyr::select(-Sab.ID., -Location.name, -Surveyor.code, - County.number, -County, -Administrative.region, -Latitude, -Longitude, -Co.ordinate.accuracy, -Day, -Month, -Year, - Intertidal.Subtidal, -starts_with("SACFOR"), -Fetch) %>%
  vif(.) %>% 
  arrange(desc(VIF))
## Warning in model.response(mf, "numeric"): using type = "numeric" with a
## factor response will be ignored
## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors
## Warning in Ops.factor(r, 2): '^' not meaningful for factors
##            Variables      VIF
## 1          WindSpeed 8.057407
## 2    WaveHeight_mean 7.142121
## 3           Salinity 7.111227
## 4            TideAmp 5.551628
## 5     FrontDist_mean 5.348088
## 6  FrontPersist_mean 4.970886
## 7      FrontDist_p05 3.711769
## 8           SST_mean 3.676061
## 9     FrontSide_mean 2.932645
## 10           SST_min 2.805090
## 11      StratifIndex 2.691842
## 12         Substrate       NA

3.2 Format dataset with final subset of environmental predictors

# Compute the occurrences from the SACFOR data
sab_env <- sab_env %>%
  dplyr::mutate(occurrence = if_else(SACFOR.numerical > 0, 1, 0))

# Select the predictors
predictors <- sab_env %>%
  # Select only the explicative variables
  dplyr::select(-Sab.ID., -Location.name, -Surveyor.code, - County.number, -County, -Administrative.region, -Latitude, -Longitude, -Co.ordinate.accuracy, -Day, -Month, -Year, - Intertidal.Subtidal, -starts_with("SACFOR"), -Fetch,-occurrence) %>%
  # Extract the names of the variables
  colnames(.)

3.3 Thin absence / presence data

par(mfrow = c(1,2))
# Plot occurrence data
plot(sab_env$Longitude,sab_env$Latitude, pch= 19,cex=1.2,col='darkgreen', axes = F, xlab = 'Longitude', ylab = 'Latitude', type = 'n', main = 'All occurrence data')
maps::map('world', add=T, fill=T, col='beige', bg='steelblue'); axis(1); axis(2)
points(sab_env[which(sab_env$occurrence==1),]$Longitude,sab_env[which(sab_env$occurrence==1),]$Latitude, pch= 19,cex=1.2,col='darkgreen')
points(sab_env[which(sab_env$occurrence==0),]$Longitude,sab_env[which(sab_env$occurrence==0),]$Latitude, pch= 4,cex=1.2,col='blue' )
legend('topleft', legend = c('Pres.','Abs.'), pch = c(19,4), col = c('darkgreen','blue'))

# Define Spatial Bins Based on Raster
if(FALSE){load(file.path(input_dir,'IE_EnvRasterStack_2.75km.RData'))
sab_env$Lon_bin <- cut(sab_env$Longitude, seq(xmin(raStack), xmax(raStack),res(raStack)[1] ) )
sab_env$Lat_bin <- cut(sab_env$Latitude , seq(ymin(raStack), ymax(raStack),res(raStack)[2] ) )
}else{
sab_env$Lon_bin <- as.factor(as.character(coord_bin$Longitude_5km))
sab_env$Lat_bin <- as.factor(as.character(coord_bin$Latitude_5km))
}

temp <- aggregate(sab_env$occurrence, FUN = mean,  by = list(Lon_bin=sab_env$Lon_bin, Lat_bin=sab_env$Lat_bin))
temp$x[temp$x > 0] = 1
summary(temp)
##     Lon_bin       Lat_bin          x         
##  -10.115:  5   53.235 : 13   Min.   :0.0000  
##  -6.063 :  5   53.135 :  8   1st Qu.:0.0000  
##  -8.463 :  5   55.185 :  8   Median :0.0000  
##  -8.963 :  5   55.235 :  7   Mean   :0.4205  
##  -9.863 :  5   52.135 :  6   3rd Qu.:1.0000  
##  -10.065:  4   52.235 :  6   Max.   :1.0000  
##  (Other):147   (Other):128
temp <- merge(x = sab_env, y = temp)
temp$dup <- duplicated(temp[,c('Lon_bin','Lat_bin')])
sab_env_thin <- temp[!temp$dup,!names(temp) %in% c('dup','Lon_bin','Lat_bin','occurrence')]
names(sab_env_thin)[ncol(sab_env_thin)] <- 'occurrence'

print(paste(nrow(sab_env) - nrow(sab_env_thin),'occurence records removed through thining.'))
## [1] "172 occurence records removed through thining."
plot(sab_env_thin$Longitude,sab_env_thin$Latitude, pch= 19,cex=1.2,col='darkgreen', axes = F, xlab = 'Longitude', ylab = 'Latitude', type = 'n', main = 'Thinned occurrence data' )
maps::map('world', add=T, fill=T, col='beige', bg='steelblue'); axis(1); axis(2)
points(sab_env_thin[which(sab_env_thin$occurrence==1),]$Longitude,sab_env_thin[which(sab_env_thin$occurrence==1),]$Latitude, pch= 19,cex=1.2,col='darkgreen')
points(sab_env_thin[which(sab_env_thin$occurrence==0),]$Longitude,sab_env_thin[which(sab_env_thin$occurrence==0),]$Latitude, pch= 4,cex=1.2,col='blue' )
points(temp[which(temp$dup),]$Longitude,temp[which(temp$dup),]$Latitude, pch= 3,cex= .6,col='red' )
legend('topleft', legend = c('Pres.','Abs.','Thining'), pch = c(19,4,3), col = c('darkgreen','blue','red'))
\s

Figure 3.4:

# Summary of the dataset after thining
summary(sab_env_thin)
##     Sab.ID.         Location.name Surveyor.code   County.number  
##  1-06   :  1   Doonbeg     :  2   Min.   :1.000   Min.   : 1.00  
##  10-05  :  1   Glasdrumman :  2   1st Qu.:1.000   1st Qu.: 8.00  
##  10-10  :  1   Inch        :  2   Median :1.000   Median :11.00  
##  10-102 :  1   Abbey Island:  1   Mean   :1.222   Mean   :11.18  
##  10-105 :  1   Annalong    :  1   3rd Qu.:1.000   3rd Qu.:15.25  
##  10-110 :  1   Ardfry Ho   :  1   Max.   :2.000   Max.   :18.00  
##  (Other):170   (Other)     :167                                  
##      County           Administrative.region    Latitude    
##  Donegal:32   Northern Ireland   : 30       Min.   :51.48  
##  Galway :28   Republic of Ireland:146       1st Qu.:52.31  
##  Cork   :21                                 Median :53.40  
##  Down   :18                                 Mean   :53.53  
##  Kerry  :13                                 3rd Qu.:54.58  
##  Antrim :11                                 Max.   :55.29  
##  (Other):53                                                
##    Longitude       Co.ordinate.accuracy      Day            Month       
##  Min.   :-10.394   < 10m  :114          Min.   : 1.00   Min.   : 1.000  
##  1st Qu.: -9.241   < 1km  : 21          1st Qu.: 4.00   1st Qu.: 2.000  
##  Median : -8.354   < 100m : 17          Median :12.00   Median : 5.000  
##  Mean   : -8.041   <100m  : 17          Mean   :12.52   Mean   : 4.768  
##  3rd Qu.: -6.644   <10m   :  4          3rd Qu.:19.00   3rd Qu.: 7.000  
##  Max.   : -5.439   < 100m :  1          Max.   :31.00   Max.   :11.000  
##                    (Other):  2          NA's   :27      NA's   :25      
##       Year      Intertidal.Subtidal SACFOR.aplhabetical SACFOR.numerical
##  Min.   :2013   Intertidal:176      N      :108         Min.   :0.000   
##  1st Qu.:2013                       C      : 28         1st Qu.:0.000   
##  Median :2014                       A      : 12         Median :0.000   
##  Mean   :2015                       P      : 12         Mean   :1.568   
##  3rd Qu.:2016                       F      :  9         3rd Qu.:4.000   
##  Max.   :2018                       R      :  5         Max.   :6.000   
##                                     (Other):  2                         
##      Fetch          Salinity        SST_mean        SST_min     
##  Min.   : 1928   Min.   :28.43   Min.   :10.56   Min.   :4.086  
##  1st Qu.:10988   1st Qu.:33.02   1st Qu.:11.29   1st Qu.:5.902  
##  Median :14976   Median :33.99   Median :11.59   Median :6.154  
##  Mean   :14724   Mean   :33.56   Mean   :11.53   Mean   :6.178  
##  3rd Qu.:19204   3rd Qu.:34.58   3rd Qu.:11.82   3rd Qu.:6.526  
##  Max.   :26011   Max.   :35.05   Max.   :12.10   Max.   :7.619  
##                                                                 
##   StratifIndex     WindSpeed         TideAmp       WaveHeight_mean
##  Min.   :1.849   Min.   : 9.256   Min.   :0.4032   Min.   :0.542  
##  1st Qu.:3.630   1st Qu.:11.251   1st Qu.:1.6750   1st Qu.:1.085  
##  Median :4.086   Median :11.732   Median :1.9313   Median :1.298  
##  Mean   :4.010   Mean   :11.770   Mean   :1.9088   Mean   :1.420  
##  3rd Qu.:4.435   3rd Qu.:12.383   3rd Qu.:2.2560   3rd Qu.:1.898  
##  Max.   :6.024   Max.   :13.744   Max.   :3.7545   Max.   :2.698  
##                                                                   
##  FrontDist_mean   FrontDist_p05   FrontPersist_mean   FrontSide_mean   
##  Min.   : 4.678   Min.   :1.175   Min.   :3.390e-06   Min.   :-0.7076  
##  1st Qu.: 7.382   1st Qu.:2.140   1st Qu.:6.620e-03   1st Qu.:-0.3791  
##  Median :10.592   Median :2.686   Median :1.245e-02   Median :-0.2839  
##  Mean   :10.678   Mean   :2.958   Mean   :1.391e-02   Mean   :-0.2738  
##  3rd Qu.:13.568   3rd Qu.:3.571   3rd Qu.:2.118e-02   3rd Qu.:-0.1748  
##  Max.   :18.564   Max.   :6.634   Max.   :3.615e-02   Max.   : 0.1534  
##                                                                        
##  Substrate   occurrence    
##  2 :48     Min.   :0.0000  
##  3 :49     1st Qu.:0.0000  
##  4 : 4     Median :0.0000  
##  5 :34     Mean   :0.4205  
##  11: 3     3rd Qu.:1.0000  
##  12: 8     Max.   :1.0000  
##  13:30
# Number of presence and background records before thining
print(paste("Before thining we had", table(sab_env$occurrence)[1], "absences and", table(sab_env$occurrence)[2], "occurences"))
## [1] "Before thining we had 174 absences and 174 occurences"
# Number of presence and background records after thining
print(paste("After thining we have", table(sab_env_thin$occurrence)[1], "absences and", table(sab_env_thin$occurrence)[2], "occurences left"))
## [1] "After thining we have 102 absences and 74 occurences left"
# From now on, we will use the thinned data
sab_env <- sab_env_thin

4 Comparison of alternative models (GLM, GAM, RF, BRT, MARS, SVM) using default settings and the ‘sdm’ package

4.1 Performance of different models

set.seed(64)

sab_sdm_data <- sdm::sdmData(occurrence~., train=sab_env %>% select(occurrence, !!predictors))

sdm_occ <- sdm(occurrence~.,data=sab_sdm_data, methods=c('glm','gam','gbm','rf','svm', "mars"),test.percent=30, n=10, replication = 'cv', ncores = 2)
  
save(sdm_occ, file = file.path(output_dir, sdm_occ_file))
load(file.path(output_dir,sdm_occ_file))

# Summary of the sdm
sdm_occ
## class                                 : sdmModels 
## ======================================================== 
## number of species                     :  1 
## number of modelling methods           :  6 
## names of modelling methods            :  glm, gam, brt, rf, svm, mars 
## replicate.methods (data partitioning) :  cross_validation 
## number of replicates (each method)    :  10 
## toral number of replicates per model  :  50 (per species) 
## ------------------------------------------
## model run success percentage (per species)  :
## ------------------------------------------
## method          occurrence       
## ---------------------- 
## glm        :        100   %
## gam        :        100   %
## brt        :        100   %
## rf         :        100   %
## svm        :        100   %
## mars       :        100   %
## 
## ###################################################################
## model Mean performance (per species), using test dataset (generated using partitioning):
## -------------------------------------------------------------------------------
## 
##  ## species   :  occurrence 
## =========================
## 
## methods    :     AUC     |     COR     |     TSS     |     Deviance 
## -------------------------------------------------------------------------
## glm        :     0.84    |     0.62    |     0.68    |     1.88     
## gam        :     0.81    |     0.58    |     0.63    |     12.04    
## brt        :     0.88    |     0.67    |     0.74    |     1.06     
## rf         :     0.91    |     0.73    |     0.75    |     0.74     
## svm        :     0.91    |     0.73    |     0.78    |     0.91     
## mars       :     0.82    |     0.59    |     0.64    |     8.53
# ROC
roc(sdm_occ)
\s

Figure 4.1:

4.2 Importance of variables for alternative methods using SDM package

Variable importance (Naimi & Araujo 2016 ECOGRAPHY) :

  • A method is to calculate the improvement of the model performance over inclusion of each variable comparing to when the variable is excluded through a cross‐validation procedure.
  • Another method is a randomization procedure that measures the correlation between the predicted values and predictions where the variable under investigation is randomly permutated. If the contribution of a variable to the model is high, then it is expected that the prediction is more affected by a permutation and therefore the correlation is lower. Therefore, ‘1 – correlation’ can be considered as a measure of variable importance

Ref : https://doi.org/10.1111/ecog.01881

# Variables importance
#---------------------

# Get the info on each runs
runs <- attributes(sdm_occ)$run.info

# Get the contribution of the variables of the different runs
var_contrib <- as.list(1:nrow(runs)) %>%
  # Retrieve the importance of the variables for each run
  purrr::map(.,~getVarImp(sdm_occ,id=.,wtest='test.dep')) %>%
  # Format this output do get a df
  purrr::map(., ~ .x %>% attributes(.) %>% extract2("varImportance")) %>%
  # Bind the details on the corresponding run
  purrr::map2_dfr(., as.list(1:nrow(runs)), ~ as.data.frame(.x) %>% mutate(method = runs[.y, "method"],replicationID  = runs[.y, "replicationID"]))

contrib <- var_contrib %>%
  group_by(method, variables) %>%
  summarise_at(vars("corTest","AUCtest"), funs('mean' = mean(.), 'sd' = sd(.), 'se' = sqrt(var(., na.rm = T)/length(.)))) %>%
  ungroup()

# AUCtest
#--------

var_order <- contrib %>%
  group_by(variables) %>%
  summarise(AUCtest_mean = mean(AUCtest_mean)) %>%
  arrange(desc(AUCtest_mean)) %>%
  pull(variables)

contrib <- contrib %>%
  mutate(variables = factor(variables, levels = var_order, ordered = TRUE))

p_AUC <- ggplot(contrib, aes(x= variables, y = AUCtest_mean, col = method, dodge = method)) +
  geom_pointrange(aes(ymin = AUCtest_mean - AUCtest_sd, ymax = AUCtest_mean + AUCtest_sd), position = position_dodge(width = 0.4)) +
  xlab("") + ylab("Mean (± sd) variable contribution (AUC-based)") +
  scale_color_manual(values = c("#404040","#bababa","lightcyan3","darkslateblue","orangered4","darkgoldenrod3"), "Method") +
  theme(axis.text.x = element_text(face="bold", color="black", 
                           size=12, angle=45, vjust = 1, hjust = 1),
        axis.text.y = element_text( color="black", 
                           size=12),
        axis.ticks = element_blank())

# CorTest
#--------

var_order <- contrib %>%
  group_by(variables) %>%
  summarise(corTest_mean = mean(corTest_mean)) %>%
  arrange(desc(corTest_mean)) %>%
  pull(variables)

contrib <- contrib %>%
  mutate(variables = factor(variables, levels = var_order, ordered = TRUE))

p_COR <- ggplot(contrib, aes(x= variables, y = corTest_mean, col = method, dodge = method)) +
  geom_pointrange(aes(ymin = corTest_mean - corTest_sd, ymax = corTest_mean + corTest_sd), position = position_dodge(width = 0.4)) +
  scale_color_manual(values = c("#404040","#bababa","lightcyan3","darkslateblue","orangered4","darkgoldenrod3"), "Method") +
  xlab("") + ylab("Mean (± sd) variable contribution (Correlation-based)") +
  theme(axis.text.x = element_text(face="bold", color="black", 
                           size=12, angle=45, vjust = 1, hjust = 1),
        axis.text.y = element_text( color="black", 
                           size=12),
        axis.ticks = element_blank())

grid.arrange(p_AUC,p_COR, ncol = 1)
\s

(#fig:sdm_varImpSD)

# AUCtest
#--------

var_order <- contrib %>%
  group_by(variables) %>%
  summarise(AUCtest_mean = mean(AUCtest_mean)) %>%
  arrange(desc(AUCtest_mean)) %>%
  pull(variables)

contrib <- contrib %>%
  mutate(variables = factor(variables, levels = var_order, ordered = TRUE))

p_AUC <- ggplot(contrib, aes(x= variables, y = AUCtest_mean, col = method, dodge = method)) +
  geom_pointrange(aes(ymin = AUCtest_mean - AUCtest_se, ymax = AUCtest_mean + AUCtest_se), position = position_dodge(width = 0.4)) +
  xlab("") + ylab("Mean (± se) variable contribution (AUC-based)") +
  scale_color_manual(values = c("#404040","#bababa","lightcyan3","darkslateblue","orangered4","darkgoldenrod3"), "Method") +
  theme(axis.text.x = element_text(face="bold", color="black", 
                           size=12, angle=45, vjust = 1, hjust = 1),
        axis.text.y = element_text( color="black", 
                           size=12),
        axis.ticks = element_blank())

# CorTest
#--------

var_order <- contrib %>%
  group_by(variables) %>%
  summarise(corTest_mean = mean(corTest_mean)) %>%
  arrange(desc(corTest_mean)) %>%
  pull(variables)

contrib <- contrib %>%
  mutate(variables = factor(variables, levels = var_order, ordered = TRUE))

p_COR <- ggplot(contrib, aes(x= variables, y = corTest_mean, col = method, dodge = method)) +
  geom_pointrange(aes(ymin = corTest_mean - corTest_se, ymax = corTest_mean + corTest_se), position = position_dodge(width = 0.4)) +
  scale_color_manual(values = c("#404040","#bababa","lightcyan3","darkslateblue","orangered4","darkgoldenrod3"), "Method") +
  xlab("") + ylab("Mean (± se) variable contribution (Correlation-based)") +
  theme(axis.text.x = element_text(face="bold", color="black", 
                           size=12, angle=45, vjust = 1, hjust = 1),
        axis.text.y = element_text( color="black", 
                           size=12),
        axis.ticks = element_blank())

grid.arrange(p_AUC,p_COR, ncol = 1)
\s

(#fig:sdm_varImp_se)


4.3 Optimised settings for BRT models (based on Elith et al. 2008)

set.seed(64)

# Number of CV folds
n_folds <- 10

# Tree complexity
tc <-  5

# Learning rate
lr <- 0.001

# bag.fraction 
bf <-  0.7

4.4 Peformance of BRT models on random Cross-Validation (for comparison with SDM package results)

# Fit BRT using 10-fold random cross-validation

brt_rand <- gbm.step(data=sab_env, gbm.x = predictors, 
                    gbm.y = which(names(sab_env) == "occurrence"),
                    family = "bernoulli", 
                    tree.complexity = tc,
                    learning.rate = lr, 
                    bag.fraction = bf, 
                    prev.stratify = TRUE,
                    n.folds = n_folds,
                    keep.fold.models = TRUE, 
                    keep.fold.vector = TRUE, 
                    keep.fold.fit = TRUE,
                    verbose = FALSE)
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for occurrence and using a family of bernoulli 
## Using 176 observations and 12 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are stratified by prevalence 
## total mean deviance =  1.3609 
## tolerance is fixed at  0.0014 
## now adding trees...
\s

(#fig:brt_randomCV)

4.5 Performance

# Extract the performance of the models
perf_rand <- ggPerformance(brt_rand)

perf_rand
##                      Model 1
## Total.Deviance     1.3608766
## Residual.Deviance  0.3395770
## Correlation        0.9300522
## AUC                0.9956000
## Per.Expl          75.0471868
## cvDeviance         0.7269154
## cvCorrelation      0.7534021
## cvAUC              0.9202800
## cvPer.Expl        46.5847687

Note that BRT fitted with optimised parameters (i.e. learning rate and tree complexity) yields similar and slighlty better performance than any of the 6 methods tested with the SDM package using default settings. For instance, cross-validation AUC is 0.92 with the optimised BRT model as opposed to 0.91 for the best performing methods (i.e. RF and MARS using the SDM package). Thus, the main results in the paper only relies on optimised BRT models with learning rate set at 0.001, and tree complexity at 5.


5 Main Results from Boosted Regression Tree using Spatial blocks for cross-validation


5.1 Spatial autocorrelation (SAC) and autoccorelation range in predictors

# Keep selected predictors and remove "Substrate" (as it is a factor) from Raster Stack
load(file.path(input_dir,'IE_EnvRasterStack_2.75km.RData'))
raStack <- subset(raStack,predictors)
raStack_variog <- dropLayer(raStack, "Substrate")

# Number of sample points for fitting the variograms
n_samp <- 5000

sac_envir <- spatialAutoRange(rasterLayer = raStack_variog,
                        sampleNumber = n_samp,
                        doParallel = TRUE,
                        showPlots = TRUE,
                        plotVariograms = FALSE)
# Plot the autocorrelation range
sac_envir$plots$barchart
\s

Figure 5.1:

This is the range over which observations are independent. It is determined by constructing an empirical variogram. The variogram models are fitted for each layers using 5000 of sample points. The plotted block size is based on the median of the spatial autocorrelation ranges. Color correspond to the first layer of the raster stack which is Salinity

# Function to plot the variogram modified from the automap package to be able to change the title
source('CustomFunctions/plot.autofitVariogram.R')

for( i in 1:length(sac_envir$variograms)){
  plot.autofitVariogram(sac_envir$variograms[[i]],
                        main = names(raStack_variog)[i])
}
\s\s\s\s\s\s\s\s\s\s\s

Figure 5.2:

5.2 Generate Spatial Blocks based on environmental autocorrelation range

# Create a SpatialPointsDataFrame object
sp_df <- st_as_sf(sab_env, coords = c("Longitude", "Latitude"), crs = crs(raStack_variog))

spBlock1 <- spatialBlock(speciesData = sp_df,
  # To account for prevalence (evenly distributed records between train and test folds)
  species = 'occurrence',
  # Range for spatial blocks is based on spatial autocorrelation range in environmnental data
  theRange = sac_envir$range,
  # Generate 10 folds for cross validation
  k = n_folds,
  # Blocks assignment to folds
  selection = "random",
  biomod2Format = TRUE,
  progress = TRUE,
  verbose = TRUE,
  # Argument for visualisation
  showBlocks = TRUE,
  rasterLayer = raStack_variog[["WaveHeight_mean"]])
## The best folds was in iteration 18:
##    train_0 train_1 test_0 test_1
## 1       94      65      8      9
## 2       82      71     20      3
## 3       95      70      7      4
## 4       97      65      5      9
## 5       93      71      9      3
## 6       93      67      9      7
## 7       94      65      8      9
## 8       88      67     14      7
## 9       92      60     10     14
## 10      90      65     12      9
# Number of data per fold
table(spBlock1$foldID)
## 
##  1  2  3  4  5  6  7  8  9 10 
## 17 23 11 14 12 16 17 21 24 21
# Number of presence and absence by folds
data.frame(pa = sp_df$occurrence, folds = spBlock1$foldID) %>%
  group_by(folds, pa) %>%
  summarise(n=n()) %>%
  spread(pa,n)
## Error in get(genname, envir = envir) : object 'vec_proxy' not found
## Error in get(genname, envir = envir) : object 'vec_ptype2' not found
## # A tibble: 10 x 3
## # Groups:   folds [10]
##    folds   `0`   `1`
##    <int> <int> <int>
##  1     1     8     9
##  2     2    20     3
##  3     3     7     4
##  4     4     5     9
##  5     5     9     3
##  6     6     9     7
##  7     7     8     9
##  8     8    14     7
##  9     9    10    14
## 10    10    12     9
spBlock1$plots + 
  geom_sf(data = sp_df, aes(color = as.factor(occurrence)), alpha = 0.7) +
  scale_color_manual(values = c("black", "deepskyblue"), name = "Occurence")
\s

(#fig:env_blocking_plot1)

5.3 Fit BRT using spatial blocks for cross-validation

set.seed(64)

# Fit BRT with spatial folds for cross-validation defined from 'spBlocks'
brt_review_block1 <- gbm.step(data=sab_env, gbm.x = predictors, 
                    gbm.y = which(names(sab_env) == "occurrence"),
                    family = "bernoulli", 
                    tree.complexity = tc,
                    learning.rate = lr, 
                    bag.fraction = bf, 
                    fold.vector =  spBlock1$foldID,
                    n.folds = n_folds,
                    keep.fold.models = TRUE, 
                    keep.fold.vector = TRUE, 
                    keep.fold.fit = TRUE,
                    verbose = FALSE)
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for occurrence and using a family of bernoulli 
## Using 176 observations and 12 predictors 
## loading user-supplied fold vector 
## creating 10 initial models of 50 trees 
## 
##  folds are stratified by prevalence 
## total mean deviance =  1.3609 
## tolerance is fixed at  0.0014 
## now adding trees...
\s

(#fig:fit_brt_block1)

5.4 Performance

# Extract the performance of the models
perf_review_block1 <- ggPerformance(brt_review_block1)

perf_review_block1
##                      Model 1
## Total.Deviance     1.3608766
## Residual.Deviance  0.4628235
## Correlation        0.8977138
## AUC                0.9883000
## Per.Expl          65.9907777
## cvDeviance         0.9140186
## cvCorrelation      0.6252818
## cvAUC              0.8441800
## cvPer.Expl        32.8360376

5.5 Initial and residual SAC

# Plot residuals of the brt model
resid_df <- data.frame(fitted = brt_review_block1$fitted, residuals = brt_review_block1$residuals) %>%
  bind_cols(sab_env, .)

ggplot(data = resid_df, aes(x = as.factor(occurrence), y = fitted)) +
  geom_violin(aes(fill = as.factor(occurrence)), alpha = 0.7) +
  geom_boxplot(width = 0.05)  +
  scale_fill_manual(values = c("#404040","#ca0020"), guide = FALSE) +
  xlab("Observation") + ylab("Fitted")
\s

Figure 5.3:

# Initial SAC
sac_init <- correlog(x=resid_df$Longitude,
               y=resid_df$Latitude,
               z=resid_df$occurrence,
               increment=5,
               resamp=1000,
               latlon=TRUE)
## 100  of  1000 
200  of  1000 
300  of  1000 
400  of  1000 
500  of  1000 
600  of  1000 
700  of  1000 
800  of  1000 
900  of  1000 
1000  of  1000 
plot(sac_init, ylab = "Moran's I", main = "Raw occurrences")
\s

Figure 5.4:

# Residual sac
sac_resid <- correlog(x=resid_df$Longitude,
               y=resid_df$Latitude,
               z=resid_df$residuals,
               increment=5,
               resamp=1000,
               latlon=TRUE)
## 100  of  1000 
200  of  1000 
300  of  1000 
400  of  1000 
500  of  1000 
600  of  1000 
700  of  1000 
800  of  1000 
900  of  1000 
1000  of  1000 
plot(sac_resid, ylab = "Moran's I", main = "Residuals")
\s

Figure 5.5:

# Initial variogram
coordinates(resid_df) <- ~ Longitude + Latitude
proj4string(resid_df) = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0")

vario_init <- autofitVariogram(occurrence~1, resid_df)

plot.autofitVariogram(vario_init, main = "Raw occurrences")
\s

Figure 5.6:

# Residual variogram
vario_resid <- autofitVariogram(residuals~1, resid_df)

plot.autofitVariogram(vario_resid, main = "Residuals")
\s

Figure 5.7:

5.6 Importance of the variables and shape of the responses

# Influence of the variables
ggInfluence(brt_review_block1, col.bar = "black")
\s

Figure 5.8:

##                      rel.inf
## WaveHeight_mean   26.8086304
## TideAmp           18.6629198
## StratifIndex      14.3900328
## Substrate         13.0702817
## SST_min            7.4195160
## WindSpeed          6.3518517
## FrontSide_mean     4.2239418
## SST_mean           2.6658032
## Salinity           2.0351637
## FrontDist_p05      1.7768810
## FrontPersist_mean  1.6058720
## FrontDist_mean     0.9891058
# Fitted values
ggPDfit(brt_review_block1,rug = T, smooth = TRUE, se = FALSE)
\s

Figure 5.9:

# Fitted values
ggPD(brt_review_block1,rug = T)
\s

Figure 5.10:

5.7 Fitted functions with bootstrap for confidence interval

brt_occ.prerun <- plot.gbm.4list(brt_review_block1)

brt_occ.boot <- gbm.bootstrap.functions(brt_review_block1, list.predictors=brt_occ.prerun, n.reps= 1000)

save(brt_review_block1, brt_occ.prerun, brt_occ.boot, file = file.path(output_dir,"brt_occ_SpBlock1_bootstrap.Rdata"))
load(file.path(output_dir,"brt_occ_SpBlock1_bootstrap.Rdata"))

source('CustomFunctions/ggPR_boot.plot.R')

# Draw partial dependency plots for the 4 most influential predictors

p1 <- ggPD_boot(brt_review_block1,
                predictor = "WaveHeight_mean",
                list.4.preds=brt_occ.prerun,
                booted.preds=brt_occ.boot$function.preds, 
                n.plots = 1, rug = T,
                cex.line=1, alpha.dot=0.4, type.ci = "ribbon", alpha.ci=0.4,
                col.ci = "#bdbdbd", col.line = "#636363",
                y.label = "", x.label = "Wave Height")
\s

Figure 5.11:

p2 <- ggPD_boot(brt_review_block1,
                predictor = "TideAmp",
                list.4.preds=brt_occ.prerun,
                booted.preds=brt_occ.boot$function.preds, 
                n.plots = 1, rug = T,
                cex.line=1, alpha.dot=0.4, type.ci = "ribbon", alpha.ci=0.4,
                col.ci = "#bdbdbd", col.line = "#636363",
                y.label = "", x.label = "Tidal Amplitude")
\s

Figure 5.12:

p3 <- ggPD_boot(brt_review_block1,
                predictor = 'StratifIndex',
                list.4.preds=brt_occ.prerun,
                booted.preds=brt_occ.boot$function.preds, 
                n.plots = 1, rug = T,
                cex.line=1, alpha.dot=0.4, type.ci = "ribbon", alpha.ci=0.4,
                col.ci = "#bdbdbd", col.line = "#636363",
                x.label = "Stratification Index", y.label = "")
\s

Figure 5.13:

## INFLUENCE PLOT FOR SUBSTRATE
substrate_summary <- (apply( brt_occ.boot$function.preds[1:7,12,], 1,FUN = function(x) c('mean' = mean(x), 'se' = sqrt(var(x, na.rm = T)/length(x))) ))

substrate_summary <- data.frame(Substrate = factor(c("Sand","Coarse substrate","Mixed sediment","Rock & boulders","Mud","Sandy Mud","Muddy Sand"), levels = c("Sand","Coarse substrate","Mixed sediment","Rock & boulders","Mud","Sandy Mud","Muddy Sand")), t(substrate_summary))
names(substrate_summary) = c("Substrate",'Mean','SE')

p4 <- ggplot(substrate_summary, 
             aes(x = Substrate, y = Mean)) +
      geom_pointrange(aes(ymin = Mean - SE, ymax = Mean + SE))+
  scale_x_discrete(name = paste0('Substrate (',round(c(ggInfluence(brt_review_block1, col.bar = "black"))[[1]][4], digits= 1),'%)'), labels = substrate_summary$Substrate)+
    scale_y_continuous(name="")+
  theme(axis.text.x = element_text(color="black", 
                           size=11, angle=45, vjust = 1, hjust = 1))
\s

Figure 5.14:

p4 
\s

Figure 5.15:

p5 <- ggPD_boot(brt_review_block1,
                predictor = 'SST_min',
                list.4.preds=brt_occ.prerun,
                booted.preds=brt_occ.boot$function.preds, 
                n.plots = 1, rug = T,
                cex.line=1, alpha.dot=0.4, type.ci = "ribbon", alpha.ci=0.4,
                col.ci = "#bdbdbd", col.line = "#636363",
                x.label = "SST min", y.label = "")
\s

Figure 5.16:

grid.arrange(p1,p2,p3,p4,ncol = 4)
\s

Figure 5.17:

5.8 Interactions between variables

# Interaction sizes
ggInteract_list(brt_review_block1,index = T)
##   var1.index      var1.names var2.index      var2.names int.size
## 1          7 WaveHeight_mean          6         TideAmp    88.63
## 2          7 WaveHeight_mean          4    StratifIndex    47.43
## 3          6         TideAmp          3         SST_min    35.07
## 4         12       Substrate          7 WaveHeight_mean    27.90
## 5          7 WaveHeight_mean          3         SST_min    14.62
## 6          7 WaveHeight_mean          5       WindSpeed     9.35
## 7         12       Substrate          3         SST_min     2.79
# Color palette
pal <- colorRampPalette(rev(brewer.pal(11, 'RdYlBu')))(100)

# Interaction WaveHeight_mean & TideAmp
ggInteract_3D(brt_review_block1, x = "WaveHeight_mean", y = "TideAmp", col.gradient = pal)
## maximum value =  0.65

Figure 5.18:

ggInteract_2D(gbm.object = brt_review_block1, x = "WaveHeight_mean", y = "TideAmp",show.dot = T,col.dot = "grey20",alpha.dot = 0.5,cex.dot = 0.2,label.contour = T,show.axis = T,legend = T, contour = T, smooth = "model", binwidth = 0.1, col.contour = "black") + scale_fill_distiller(palette = 'RdYlBu')
## maximum value =  0.7
\s

Figure 5.19:

# Interaction WaveHeight_mean & StratifIndex
ggInteract_3D(brt_review_block1, x = "WaveHeight_mean", y = "StratifIndex", col.gradient = pal)
## maximum value =  0.67

Figure 5.20:

ggInteract_2D(gbm.object = brt_review_block1, x = "WaveHeight_mean", y = "StratifIndex",show.dot = T,col.dot = "grey20",alpha.dot = 0.5,cex.dot = 0.2,label.contour = T,show.axis = T,legend = T, contour = T, smooth = "model", binwidth = 0.1, col.contour = "black") + scale_fill_distiller(palette = 'RdYlBu')
## maximum value =  0.72
\s

Figure 5.21:

# Interaction TideAmp & SST_min
ggInteract_3D(brt_review_block1, x = "TideAmp", y = "SST_min", col.gradient = pal)
## maximum value =  0.61

Figure 5.22:

ggInteract_2D(gbm.object = brt_review_block1, x = "TideAmp", y = "SST_min",show.dot = T,col.dot = "grey20",alpha.dot = 0.5,cex.dot = 0.2,label.contour = T,show.axis = T,legend = T, contour = T, smooth = "model", binwidth = 0.1, col.contour = "black") + scale_fill_distiller(palette = 'RdYlBu')
## maximum value =  0.63
\s

Figure 5.23:

# Interaction Substrate & WaveHeight_mean
ggInteract_2D(gbm.object = brt_review_block1, x =  "Substrate", y =  "WaveHeight_mean") + 
  scale_color_brewer(palette = "Dark2",labels=c("Sand","Coarse substrate","Mixed sediment","Rock & boulders","Mud","Sandy Mud","Muddy Sand")) +
  scale_linetype_discrete(labels=c("Sand","Coarse substrate","Mixed sediment","Rock & boulders","Mud","Sandy Mud","Muddy Sand"))
## maximum value =  0.87
\s

Figure 5.24:

# Interaction WaveHeight_mean & SST_min 
ggInteract_3D(brt_review_block1, x = "WaveHeight_mean", y = "SST_min", col.gradient = pal)
## maximum value =  0.65

Figure 5.25:

ggInteract_2D(gbm.object = brt_review_block1, x = "WaveHeight_mean", y = "SST_min",show.dot = T,col.dot = "grey20",alpha.dot = 0.5,cex.dot = 0.2,label.contour = T,show.axis = T,legend = T, contour = T, smooth = "model", binwidth = 0.1, col.contour = "black") + scale_fill_distiller(palette = 'RdYlBu')
## maximum value =  0.71
\s

Figure 5.26:

# Interaction WaveHeight_mean & WindSpeed 
ggInteract_3D(brt_review_block1, x = "WaveHeight_mean", y = "WindSpeed", col.gradient = pal)
## maximum value =  0.65

Figure 5.27:

ggInteract_2D(gbm.object = brt_review_block1, x = "WaveHeight_mean", y = "WindSpeed",show.dot = T,col.dot = "grey20",alpha.dot = 0.5,cex.dot = 0.2,label.contour = T,show.axis = T,legend = T, contour = T, smooth = "model", binwidth = 0.1, col.contour = "black") + scale_fill_distiller(palette = 'RdYlBu')
## maximum value =  0.7
\s

Figure 5.28:

5.9 Spatial predictions

load(file.path(input_dir,'IE_EnvRasterStack_2.75km.RData'))

rat = (levels(raStack[['Substrate']]))[[1]]
rat$type <- c("2_Sand","3_Coarse_substrate", "4_Mixed_sediment", "5_Rock_&_boulders",
"11_Mud","12_Sandy_Mud", "13_Muddy_Sand")
rat$type <- c("2","3", "4", "5",
"11","12", "13")
levels(raStack[['Substrate']]) <- rat

raStackDF <- as.data.frame(raStack, xy = T)
raStackDF$Substrate <- as.factor(raStackDF$Substrate_type)
save(raStackDF, file = file.path(input_dir,"raStackDF.Rdata"))

# Prediction as probabilities
pred_brt <- predict(raStack, brt_review_block1,  n.trees=brt_review_block1$gbm.call$best.trees, type="response")

par(mfrow = c(1,1))
#plot(pred_brt, main =  'BRT spatial predictions (Proba. of presence)')

# Prediction as probabilities from data.frame
raStackDF <- raStackDF[!(is.na(raStackDF$Substrate) | is.na(raStackDF$Aspect) ),]

sp_brt_pred = predict(brt_review_block1, newdata = raStackDF,n.trees=brt_review_block1$gbm.call$best.trees, type = 'response')

mat <- data.frame(pred= sp_brt_pred, Longitude= as.factor(raStackDF[,1]), Latitude= as.factor(raStackDF[,2]))
mat<- with(mat, tapply(pred, list(Longitude, Latitude), sum))

coastline <- rgdal::readOGR(file.path('coastlinesNOAA/','extract_coastline.shp'))
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/mpmarzlo/Documents/Collaborations/201906_SabellariaIreland(LouiseFIRTH)/SDM_SupMat_Nov2020/coastlinesNOAA/extract_coastline.shp", layer: "extract_coastline"
## with 51820 features
## It has 6 fields
par(mfrow = c(1,2))
cex_axis = 1.2
filled.contour(as.numeric(levels(as.factor(raStackDF[,1]))),
               as.numeric(levels(as.factor(raStackDF[,2]))),
               mat, color.palette= colorRampPalette(c("blue","khaki","red"), bias=1, space = "Lab"),
               zlim= c(0,1),nlevels= 50,
               xlab=NULL,ylab=NULL,
               cex=1.5,
                main = 'BRT predictions (proba.) and observed occurence (green points)',
               plot.axes={
                 axis(1, cex.axis = cex_axis); axis(2, cex.axis= cex_axis);
                            cex.axis = cex_axis;
                plot(coastline, add = T);
                points(sab_env[which(sab_env$occurrence==1),]$Longitude,sab_env[which(sab_env$occurrence==1),]$Latitude, pch= 19,cex=1.2,col='darkgreen' )})
\s

Figure 5.29:

# Prediction as presence/absence
threshold <- as.numeric(optimal.thresholds(
  DATA  = data.frame(1:nrow(sab_env), 
                     sab_env$occurrence,
             predict(brt_review_block1, newdata = sab_env, n.trees=brt_review_block1$gbm.call$best.trees, type="response")) ,
  opt.methods = 2))[2]


# Discretise predictions into presences and absences
p <- pred_brt
p[p < threshold ]  <- 0
p[p >= threshold ] <- 1

#plot(p, xlim = c(-10.7,-5), zlim = c(0,1), main = paste("BRT predictions, threshold set to", threshold))
print(paste("Threshold set to", threshold, "for BRT presence/absence predictions."))
## [1] "Threshold set to 0.47 for BRT presence/absence predictions."
# Discretise data.frame-based predictions into presences and absences

p <- sp_brt_pred
p[p < threshold ]  <- 0
p[p >= threshold ] <- 1

mat <- data.frame(pred= p, Longitude= as.factor(raStackDF[,1]), Latitude= as.factor(raStackDF[,2]))
mat<- with(mat, tapply(pred, list(Longitude, Latitude), sum))
filled.contour(as.numeric(levels(as.factor(raStackDF[,1]))),
               as.numeric(levels(as.factor(raStackDF[,2]))),
               mat, color.palette= colorRampPalette(c("blue","khaki","red"), bias=1, space = "Lab"),
               zlim= c(0,1),nlevels= 50,
               xlab=NULL,ylab=NULL,
               cex=1.5,
                main = 'BRT predictions (Pres/Abs) and observed occurence (green points)',
               
               plot.axes={
                 axis(1, cex.axis = cex_axis); axis(2, cex.axis= cex_axis);
                            cex.axis = cex_axis;
                plot(coastline, add = T);
                points(sab_env[which(sab_env$occurrence==1),]$Longitude,sab_env[which(sab_env$occurrence==1),]$Latitude, pch= 19,cex=1.2,col='darkgreen' )})
\s

Figure 5.30:


6 Session info

sessionInfo()
## R version 3.5.3 (2019-03-11)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] splines   grid      stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] gstat_2.0-5             lattice_0.20-38        
##  [3] kernlab_0.9-27          rpart_4.1-15           
##  [5] RSNNS_0.4-12            Rcpp_1.0.3             
##  [7] rJava_0.9-11            earth_5.1.2            
##  [9] plotmo_3.5.6            TeachingDemos_2.10     
## [11] plotrix_3.7-6           Formula_1.2-3          
## [13] mgcv_1.8-28             nlme_3.1-140           
## [15] mda_0.4-10              class_7.3-15           
## [17] ggpointdensity_0.1.0    RColorBrewer_1.1-2     
## [19] rgdal_1.4-3             fields_9.8-1           
## [21] spam_2.2-2              dotCall64_1.0-0        
## [23] mapdata_2.3.0           maps_3.3.0             
## [25] forcats_0.4.0           stringr_1.4.0          
## [27] readr_1.3.1             tidyr_0.8.3            
## [29] tibble_2.1.1            tidyverse_1.2.1        
## [31] scales_1.0.0            viridis_0.5.1          
## [33] viridisLite_0.3.0       ggridges_0.5.1         
## [35] ggcorrplot_0.1.3        sf_0.8-1               
## [37] ncf_1.2-9               automap_1.0-14         
## [39] blockCV_2.1.1           PresenceAbsence_1.1.9  
## [41] ROCR_1.0-7              gplots_3.0.1.1         
## [43] randomForest_4.6-14     gbm_2.1.5              
## [45] ggBRT_0.0.0.9000        reshape2_1.4.3         
## [47] plotly_4.9.0            gridExtra_2.3          
## [49] ggthemes_4.2.0          ggplot2_3.3.0          
## [51] plyr_1.8.4              dismo_1.1-4            
## [53] directlabels_2018.05.22 sdm_1.0-67             
## [55] usdm_1.1-18             raster_3.0-7           
## [57] sp_1.3-2                readxl_1.3.1           
## [59] magrittr_1.5            purrr_0.3.2            
## [61] dplyr_0.8.1            
## 
## loaded via a namespace (and not attached):
##  [1] backports_1.1.4    lazyeval_0.2.2     crosstalk_1.0.0   
##  [4] listenv_0.8.0      digest_0.6.18      htmltools_0.3.6   
##  [7] fansi_0.4.0        gdata_2.18.0       globals_0.12.5    
## [10] modelr_0.1.4       xts_0.11-2         rmdformats_0.3.5  
## [13] prettyunits_1.0.2  colorspace_1.4-1   rvest_0.3.4       
## [16] haven_2.1.0        xfun_0.7           crayon_1.3.4      
## [19] jsonlite_1.6       zeallot_0.1.0      survival_2.44-1.1 
## [22] zoo_1.8-5          glue_1.3.1         gtable_0.3.0      
## [25] questionr_0.7.0    future.apply_1.5.0 DBI_1.0.0         
## [28] miniUI_0.1.1.1     isoband_0.2.0      xtable_1.8-4      
## [31] progress_1.2.2     units_0.6-3        intervals_0.15.2  
## [34] htmlwidgets_1.3    httr_1.4.0         FNN_1.1.3         
## [37] pkgconfig_2.0.3    reshape_0.8.8      utf8_1.1.4        
## [40] tidyselect_0.2.5   labeling_0.3       rlang_0.4.2       
## [43] later_0.8.0        munsell_0.5.0      cellranger_1.1.0  
## [46] tools_3.5.3        cli_1.1.0          generics_0.0.2    
## [49] broom_0.5.2        evaluate_0.13      yaml_2.2.0        
## [52] knitr_1.23         caTools_1.17.1.2   future_1.17.0     
## [55] mime_0.6           xml2_1.2.0         compiler_3.5.3    
## [58] rstudioapi_0.10    e1071_1.7-1        spacetime_1.2-3   
## [61] stringi_1.4.3      highr_0.8          rgeos_0.5-1       
## [64] Matrix_1.2-17      classInt_0.4-3     vctrs_0.1.0       
## [67] pillar_1.4.0       cowplot_1.0.0      data.table_1.12.2 
## [70] bitops_1.0-6       httpuv_1.5.1       R6_2.4.1          
## [73] bookdown_0.11      promises_1.0.1     KernSmooth_2.23-15
## [76] codetools_0.2-16   gtools_3.8.1       assertthat_0.2.1  
## [79] withr_2.1.2        parallel_3.5.3     hms_0.4.2         
## [82] quadprog_1.5-8     rmarkdown_1.13     shiny_1.3.2       
## [85] lubridate_1.7.4

A. Boyé & M. Marzloff

10 December, 2020